⚙️ Exercise

1. Fitting NRC rankings

In 2010, the National Research Council released rankings for all doctorate programs in the USA (https://en.wikipedia.org/wiki/United_States_National_Research_Council_rankings). The data was initially released and then only available for a fee. I managed to get a copy during the free period, and this is the data that we will use for this exercise. There hasn’t been another set of rankings publicly released since then, and I don’t know why. Only the rankings and statistics for Statistics programs are included in this data set.

Your job is to answer the question: “How is R Ranking related to rankings on research, student support and diversity?” using the 5th percentile for each of these quantities. Fit your best model, try using splines, and justify your choices.

a.

Read the data. Rename the columns containing the variables of interest to rank, research, student and diversity). Using recipes, split the data into 2/3 training and 1/3 test sets.

# Read data
nrc <- read_csv(here::here("data/nrc.csv")) %>%
  mutate(rank = R.Rankings.5th.Percentile,
         research = Research.Activity.5th.Percentile,
         student = Student.Support.Outcomes.5th.Percentile,
         diversity = Diversity.5th.Percentile) %>%
  select(rank, research, student, diversity) 

# Split the data into training and test
set.seed(22)
train_test_split <- initial_split(nrc, prop = 2/3)

nrc_train <- training(train_test_split)
nrc_test <- testing(train_test_split)

b.

Make response vs predictor plots and predictor vs predictor plots of the training data. Explain the relationships, outliers, and how this would affect the modeling, and what you would expect the results of the modeling to be.

# Plot response vs predictor data with loess smooth 
ggduo(nrc_train, columnsX = c(2,3,4), columnsY = 1)

Rank has a strong relationship with research but very weak with the other two. There is one outlier with good rank but not so good research rank.

c. 

Make a scatterplot matrix of the predictors, and discuss any potential problems like multicollinearity.

# Plot response vs predictor data with loess smooth 
ggpairs(nrc_train, columns = c(2,3,4))

There is no relationship between any pairs of predictors, and no obvious outliers or clusters. The only small problem is that the distribution of each predictor is slightly right-skewed.

c. 

Fit a linear model. Report estimates, model fit statistics, and the test MSE. Make the plots to visually assess the fit, including observed vs fitted, residual vs fitted, histogram of resisuals, normal probability plot of residuals, and the fitted vs each predictor.

# Fit a linear model to the original data
lm_mod <- 
  linear_reg() %>% 
  set_engine("lm")

nrc_lm_fit <- 
  lm_mod %>% 
  fit(rank ~ ., data = nrc_train)

# Summarise model and check the fit
tidy(nrc_lm_fit)
# # A tibble: 4 × 5
#   term        estimate std.error statistic    p.value
#   <chr>          <dbl>     <dbl>     <dbl>      <dbl>
# 1 (Intercept)   2.77       4.48      0.619 0.540     
# 2 research      0.536      0.101     5.33  0.00000547
# 3 student       0.201      0.107     1.88  0.0680    
# 4 diversity    -0.0259     0.104    -0.249 0.805
glance(nrc_lm_fit)
# # A tibble: 1 × 12
#   r.squared adj.r.squared sigma statistic   p.value    df logLik   AIC   BIC
#       <dbl>         <dbl> <dbl>     <dbl>     <dbl> <dbl>  <dbl> <dbl> <dbl>
# 1     0.505         0.463  10.7      12.2 0.0000115     3  -149.  309.  317.
# # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

# broom::augment() is an easy way to get predicted
# values and residuals
nrc_lm_train_pred <- augment(nrc_lm_fit, nrc_train)
nrc_lm_test_pred <- augment(nrc_lm_fit, nrc_test)
metrics(nrc_lm_test_pred, truth = rank, 
        estimate = .pred)
# # A tibble: 3 × 3
#   .metric .estimator .estimate
#   <chr>   <chr>          <dbl>
# 1 rmse    standard      11.6  
# 2 rsq     standard       0.416
# 3 mae     standard       7.11

# Plot fitted and residuals of training data
p_f <- ggplot(nrc_lm_train_pred) +
  geom_point(aes(x = .pred, y = rank)) 
p_e <- ggplot(nrc_lm_train_pred) +
  geom_point(aes(x = .pred, y = .resid)) 
p_h <- ggplot(nrc_lm_train_pred, aes(x = .resid)) +
  geom_histogram(binwidth=2.5, colour="white") +
  geom_density(aes(y=..count..), bw = 2, colour="orange")
p_q <- ggplot(nrc_lm_train_pred, aes(sample = .resid)) +
  stat_qq() +
  stat_qq_line() +
  xlab("theoretical") + ylab("sample")
p_q + p_e + p_h + p_f 

  # Plot the fitted model against each predictor
p1 <- ggplot(nrc_lm_train_pred) +
  geom_point(aes(x = research, y = rank)) +
  geom_point(aes(x = research, y = .pred),
             colour="blue")
p2 <- ggplot(nrc_lm_train_pred) +
  geom_point(aes(x = student, y = rank)) +
  geom_point(aes(x = student, y = .pred),
             colour="blue")
p3 <- ggplot(nrc_lm_train_pred) +
  geom_point(aes(x = diversity, y = rank)) +
  geom_point(aes(x = diversity, y = .pred),
             colour="blue")
p1 + p2 + p3

Training \(R^2\) is 0.505 . The model explains about 50.5% of the variation in rank. Only research is statistically significant for the model. Test RMSE is 11.6

The model fit is reasonably weak. There is some moderate heteroskedasticity, with smaller spread at low ranks. There are long tails, that is, outliers at both small and large values.

The plots against individual predictors shows that research is the only variable strongly contributing to the fit, which is supported by the significance tests associated with each parameter estimate.

e.

Fit a splines model. Report estimates, model fit statistics, and the test RMSE. Make the plots to visually assess the fit, including observed vs fitted, residual vs fitted, histogram of resisuals, normal probability plot of residuals, and the fitted vs each predictor.

nrc_ns_rec <-    
  recipe(rank ~ research + student + diversity, 
         data = nrc_train) %>% 
  step_ns(research, student, diversity, deg_free = 3) 

# Define workflow
nrc_ns_wf <- workflow() %>%
  add_model(lm_mod) %>% 
  add_recipe(nrc_ns_rec)

# Fit a linear model to the transformed data
nrc_ns_fit <- fit(nrc_ns_wf, data=nrc_train)

# Summarise model and check the fit
tidy(nrc_ns_fit)
# # A tibble: 10 × 5
#    term           estimate std.error statistic  p.value
#    <chr>             <dbl>     <dbl>     <dbl>    <dbl>
#  1 (Intercept)      16.9        9.37    1.80   0.0811  
#  2 research_ns_1    11.2        7.46    1.50   0.144   
#  3 research_ns_2    25.5       13.6     1.87   0.0713  
#  4 research_ns_3    28.0        6.41    4.38   0.000134
#  5 student_ns_1     14.2        7.17    1.98   0.0571  
#  6 student_ns_2      3.93      13.0     0.303  0.764   
#  7 student_ns_3      9.22       7.89    1.17   0.252   
#  8 diversity_ns_1    0.241      8.81    0.0274 0.978   
#  9 diversity_ns_2  -22.7       11.3    -2.01   0.0540  
# 10 diversity_ns_3   -2.18       7.67   -0.284  0.778
glance(nrc_ns_fit)
# # A tibble: 1 × 12
#   r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
#       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
# 1     0.600         0.480  10.5      5.01 0.000376     9  -145.  312.  331.
# # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

# augment for splines only extracts the fitted
# values and not residuals
nrc_ns_train_pred <- augment(nrc_ns_fit, nrc_train) %>%
  mutate(.resid = rank - .pred)
nrc_ns_test_pred <- augment(nrc_ns_fit, nrc_test) %>%
  mutate(.resid = rank - .pred)
metrics(nrc_ns_test_pred, truth = rank, 
        estimate = .pred)
# # A tibble: 3 × 3
#   .metric .estimator .estimate
#   <chr>   <chr>          <dbl>
# 1 rmse    standard      13.8  
# 2 rsq     standard       0.224
# 3 mae     standard       8.84

# Plot fitted and residuals of training data
ps_f <- ggplot(nrc_ns_train_pred) +
  geom_point(aes(x = .pred, y = rank)) 
ps_e <- ggplot(nrc_ns_train_pred) +
  geom_point(aes(x = .pred, y = .resid)) 
ps_h <- ggplot(nrc_ns_train_pred, aes(x = .resid)) +
  geom_histogram(binwidth=2.5, colour="white") +
  geom_density(aes(y=..count..), bw = 2, colour="orange")
ps_q <- ggplot(nrc_ns_train_pred, aes(sample = .resid)) +
  stat_qq() +
  stat_qq_line() +
  xlab("theoretical") + ylab("sample")
ps_q + ps_e + ps_h + ps_f

# Plot the fitted model against each predictor
ps1 <- ggplot(nrc_ns_train_pred) +
  geom_point(aes(x = research, y = rank)) +
  geom_point(aes(x = research, y = .pred),
             colour="blue")
ps2 <- ggplot(nrc_ns_train_pred) +
  geom_point(aes(x = student, y = rank)) +
  geom_point(aes(x = student, y = .pred),
             colour="blue")
ps3 <- ggplot(nrc_ns_train_pred) +
  geom_point(aes(x = diversity, y = rank)) +
  geom_point(aes(x = diversity, y = .pred),
             colour="blue")
ps1 + ps2 + ps3

Training \(R^2\) is 0.6 . The model explains about 60% of the variation in rank. Test RMSE is 13.8.

The use of splines improves the model fit a very small amount, based on \(R^2\), but the RMSE is worse.

f. 

Fit a GAM model. Report estimates, model fit statistics, and the test RMSE. Make the plots to visually assess the fit, including observed vs fitted, residual vs fitted, histogram of resisuals, normal probability plot of residuals, and the fitted vs each predictor.

# mgcv::gam is not yet integrated with tidymodels
nrc_gam <- 
  mgcv::gam(rank ~
        s(research) +
        s(student) + 
        s(diversity), data = nrc_train)

# Model fit
tidy(nrc_gam)
# # A tibble: 3 × 5
#   term           edf ref.df statistic    p.value
#   <chr>        <dbl>  <dbl>     <dbl>      <dbl>
# 1 s(research)   1.00   1.00     30.7  0.00000329
# 2 s(student)    1.00   1.00      4.85 0.0344    
# 3 s(diversity)  2.29   2.87      1.26 0.406
glance(nrc_gam)
# # A tibble: 1 × 7
#      df logLik   AIC   BIC deviance df.residual  nobs
#   <dbl>  <dbl> <dbl> <dbl>    <dbl>       <dbl> <int>
# 1  5.29  -147.  307.  317.    3647.        34.7    40

nrc_gam_train_pred <- augment(nrc_gam, nrc_train) %>%
  rename(.pred = .fitted)
nrc_gam_test_pred <- augment(nrc_gam,
                             newdata=nrc_test) %>%
  rename(.pred = .fitted) %>%
  mutate(.resid = rank - .pred)
metrics(nrc_gam_test_pred, truth = rank, 
        estimate = .pred)
# # A tibble: 3 × 3
#   .metric .estimator .estimate
#   <chr>   <chr>          <dbl>
# 1 rmse    standard      12.6  
# 2 rsq     standard       0.326
# 3 mae     standard       7.92

draw(nrc_gam, residuals = TRUE)

appraise(nrc_gam)

# Plot the fitted model against each predictor
pg1 <- ggplot(nrc_gam_train_pred) +
  geom_point(aes(x = research, y = rank)) +
  geom_point(aes(x = research, y = .pred),
             colour="blue")
pg2 <- ggplot(nrc_gam_train_pred) +
  geom_point(aes(x = student, y = rank)) +
  geom_point(aes(x = student, y = .pred),
             colour="blue")
pg3 <- ggplot(nrc_gam_train_pred) +
  geom_point(aes(x = diversity, y = rank)) +
  geom_point(aes(x = diversity, y = .pred),
             colour="blue")
pg1 + pg2 + pg3

GAMs don’t report \(R^2\). The deviance is 3646.7 which is interpreted as what the model DOES NOT explain. Compare this with the overall total sum of squares of the response variable 8246.8 would indicate that model explains about 56% of the variation in rank. Test RMSE is 12.6.

The additional use of GAMs improves the fit by a very small amount, based on \(R^2\) but it is worse than the linear fit based on RMSE.

  1. Based on the model fit statistics, and test mse, and plots, which model is best? How do the predictors relate to the rank? Are all the predictors important for predicting rank? Could the model be simplified?

The models perform very similarly, so we would choose the simpler one, linear regression. And recommend that for a final model, only to use research as a predictor.

2. (IF TIME ALLOWS) Lurking variables

a.

The wages data poses an interesting analytical dilemma. There appear to be two wage clusters, one small group with relatively consistent high wage, and the other big group with lower and varied wages. Make a histogram or a density plot of wage to check this more carefully.

ggplot(Wage, aes(x=wage)) + geom_density()

b.

What do you think might have caused this? Check the relationship between wage, and other variables such as jobclass. Do any of the other variables provided in the data explain the clustering?

a1 <- ggplot(Wage, aes(x=age, y=logwage, colour=jobclass)) + 
  geom_point(alpha=0.5) +
  scale_colour_brewer("", palette="Dark2") +
  theme(legend.position = "none")
a2 <- ggplot(Wage, aes(x=age, y=logwage, colour=health)) + 
  geom_point(alpha=0.5) +
  scale_colour_brewer("", palette="Dark2") +
  theme(legend.position = "none")
a3 <- ggplot(Wage, aes(x=age, y=logwage, colour=health_ins)) + 
  geom_point(alpha=0.5) +
  scale_colour_brewer("", palette="Dark2") +
  theme(legend.position = "none")
a4 <- ggplot(Wage, aes(x=age, y=logwage, colour=race)) +
  geom_point(alpha=0.5) +
  scale_colour_brewer("", palette="Dark2") +
  theme(legend.position = "none")
a5 <- ggplot(Wage, aes(x=age, y=logwage, colour=maritl)) + 
  geom_point(alpha=0.5) +
  scale_colour_brewer("", palette="Dark2") +
  theme(legend.position = "none")
a1+a2+a3+a4+a5

No other variable in this collected data explains the cluster of salaries higher than $250. This is very likely, though, to be something simple like a “manager” position, and would be best called a “lurking variable”.

c. 

The textbook (Section 7.8.1 p.315) includes an analysis where a separate logistic model where a binary response (\(\leq 250\), \(>250\)) is used. Why doesn’t this make sense?

An alternative approach is to treat this as a “lurking variable”, let’s call it “manager”, create a new predictor and include this in the model. Alternatively, we could treat the high group as outliers, and exclude them from the analysis. The argument for these values are likely due to some unobserved factor, and their presence affects the ability to accurately model the rest of the data.

There is little support for predicting the wage above or below $250 on any of the variables available, which means it would be a weak model, at best.

3. (IF TIME ALLOWS) Explore the polynomial model fitting

a.

This builds from the polynomial model fit for the Wage data, using variables wage and age, in Figure 7.1.

The function poly is a convenient way to generate a fourth-degree polynomial. By default it uses “orthogonal polynomials”.

Wage <- as_tibble(Wage)
rec_poly <- recipe(wage ~ age, data = Wage) %>%
  step_poly(age, degree = 4)

lm_spec <- linear_reg() %>%
  set_mode("regression") %>%
  set_engine("lm")

poly_wf <- workflow() %>%
  add_model(lm_spec) %>%
  add_recipe(rec_poly)

poly_fit <- fit(poly_wf, data = Wage)
poly_fit
# ══ Workflow [trained] ══════════════════════════════════════════════════════════
# Preprocessor: Recipe
# Model: linear_reg()
# 
# ── Preprocessor ────────────────────────────────────────────────────────────────
# 1 Recipe Step
# 
# • step_poly()
# 
# ── Model ───────────────────────────────────────────────────────────────────────
# 
# Call:
# stats::lm(formula = ..y ~ ., data = data)
# 
# Coefficients:
# (Intercept)   age_poly_1   age_poly_2   age_poly_3   age_poly_4  
#      111.70       447.07      -478.32       125.52       -77.91

tidy(poly_fit)
# # A tibble: 5 × 5
#   term        estimate std.error statistic  p.value
#   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
# 1 (Intercept)    112.      0.729    153.   0       
# 2 age_poly_1     447.     39.9       11.2  1.48e-28
# 3 age_poly_2    -478.     39.9      -12.0  2.36e-32
# 4 age_poly_3     126.     39.9        3.14 1.68e- 3
# 5 age_poly_4     -77.9    39.9       -1.95 5.10e- 2

b.

We can request that “raw” polynomials are generated instead, with the raw=TRUE argument.

rec_raw_poly <- recipe(wage ~ age, data = Wage) %>%
  step_poly(age, degree = 4, options = list(raw = TRUE))

raw_poly_wf <- workflow() %>%
  add_model(lm_spec) %>%
  add_recipe(rec_raw_poly)

raw_poly_fit <- fit(raw_poly_wf, data = Wage)

tidy(raw_poly_fit)
# # A tibble: 5 × 5
#   term            estimate  std.error statistic  p.value
#   <chr>              <dbl>      <dbl>     <dbl>    <dbl>
# 1 (Intercept) -184.        60.0           -3.07 0.00218 
# 2 age_poly_1    21.2        5.89           3.61 0.000312
# 3 age_poly_2    -0.564      0.206         -2.74 0.00626 
# 4 age_poly_3     0.00681    0.00307        2.22 0.0264  
# 5 age_poly_4    -0.0000320  0.0000164     -1.95 0.0510

c. 

The coefficients are different, but effectively the fit is the same, which can be seen by plotting the fitted values from the two models.

wage_fit <- Wage %>% bind_cols(
   predict(poly_fit, Wage),
   predict(raw_poly_fit, Wage)) %>%
  rename(.pred_poly = .pred...12, 
         .pred_raw_poly = .pred...13) 

ggplot(wage_fit, aes(x=.pred_poly,
                     y=.pred_raw_poly)) + 
  geom_point() + theme(aspect.ratio = 1)

d. 

To examine the differences between orthonormal polynomials and “raw” polynomials, we can make scatterplot matrices of the two sets of polynomials.

p_orth <- as_tibble(poly(Wage$age, 4))
ggscatmat(p_orth)

p_raw <- as_tibble(poly(Wage$age, 4, raw=TRUE))
ggscatmat(p_raw)

e.

Think about: What is the benefit of using orthonomal polynomials?

As higher order raw polynomials are added multicollinearity is introduced. The orthogonal polynomials add perturbations to the function preventing linear dependency between terms.

💬 Class discussion exercises, part of the wrap up

Why do you think polynomials and splines considered to be part of recipes rather than the model fitting?

Both of these are transformations of the predictors, and are best considered to be pre-processing (feature engineering) of the data prior to modeling.